IBM HR Analytics Employee Attrition and Performance Dataset


IBM HR Analytics Employee Attrition and Performance Dataset

In this study, we analyze HR data available from kaggle.com. This data is fictional and it is created by IBM data scientists.

Standardized Dataset

Problem Description

In the dataset, Attrition represents whether an employee is churned or not. We would like to create a predictive model that predicts this feature.

X and y sets

Correlation of the features

Training and testing sets

StratifiedKFold is a variation of k-fold which returns stratified folds: each set contains approximately the same percentage of samples of each target class as the complete set.

Modeling: Keras Multi-layer Perceptron (MLP) for Binary classification

A multi-layer perceptron (MLP) is a class of feedforward artificial neural network (ANN). The algorithm at each iteration uses the Cross-Entropy Loss to measure the loss, and then the gradient and the model update is calculated. At the end of this iterative process, we would reach a better level of agreement between test and predicted sets since the error would be lower from that of the first step.

Our model here utilizes the accuracy and recall scores.

Confusion Matrix

The confusion matrix allows for visualization of the performance of an algorithm.

Some of the metrics that we use here to mesure the accuracy: \begin{align} \text{Confusion Matrix} = \begin{bmatrix}T_p & F_p\\ F_n & T_n\end{bmatrix}. \end{align}

where $T_p$, $T_n$, $F_p$, and $F_n$ represent true positive, true negative, false positive, and false negative, respectively.

\begin{align} \text{Precision} &= \frac{T_{p}}{T_{p} + F_{p}},\\ \text{Recall} &= \frac{T_{p}}{T_{p} + F_{n}},\\ \text{F1} &= \frac{2 \times \text{Precision} \times \text{Recall}}{\text{Precision} + \text{Recall}}\\ \text{Balanced-Accuracy (bACC)} &= \frac{1}{2}\left( \frac{T_{p}}{T_{p} + F_{n}} + \frac{T_{n}}{T_{n} + F_{p}}\right ) \end{align}

The accuracy can be a misleading metric for imbalanced data sets. In these cases, a balanced accuracy (bACC) [6] is recommended that normalizes true positive and true negative predictions by the number of positive and negative samples, respectively, and divides their sum by two.


References

  1. Kaggle Dataset: IBM HR Analytics Employee Attrition & Performance
  2. scikit-learn: classifiers
  3. scikit-learn: Metrics and scoring: quantifying the quality of predictions
  4. Confusion matrix
  5. The Sequential model
  6. Mower, Jeffrey P. "PREP-Mt: predictive RNA editor for plant mitochondrial genes." BMC bioinformatics 6.1 (2005): 1-15.